## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.5     ✓ dplyr   1.0.3
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows

1 Inclusion exclusion criteria

Definitions:

*Cancer diagnosis information from HES, coded using ICD10 and ICD9 codes. Additional cancer information at baseline from Self-reported cancers.

2 Composite scores

Composite scores were computed for Physical activity, Total meat intake, Red meat intake, White meat intake and Total fruit/veg intake.

Physical activity:

Total fruit/veg intake:

Total meat intake:

PCA was performed for numerical variables (physical activity and total fruit/veg intake). The first component was used as the composite score.

Multiple Correspondence Analysis (MCA) was performed for categorical variables (mean intake). The first component was used as the composite score.

3 Table 1

Student’s t-test used to compare continuous variables, chi-squared test used to compare categorical variables. There is a separate table for biomarkers and also a table to appear in the supplementary material with variables such as the composite scores described above. Any p values <0.001 are not printed, but p values below the Bonferroni threshold are emboldened.

## 
## Attaching package: 'flextable'
## The following objects are masked from 'package:kableExtra':
## 
##     as_image, footnote
## The following object is masked from 'package:purrr':
## 
##     compose

4 Univariate Analysis

Continuous variables were scaled prior to running logistic regression models to help visualize confidence intervals.

Manhattan

Manhattan

Significant for both models:

Sociodemographic:

Health risk:

Environment:

Medical:

Biomarkers:

Manhattan

Manhattan

Significant for both models: diabetes, HDL cholesterol.

P Values

P Values

A snapshot of all relevant p values. In this limited graph, we see that diabetes is significant for both models when not adjusted for smoking.

P Values

P Values

Diabetes, HDL cholesterol, and education:

P Values

P Values

All variables significant for lung cancer are marked. We see a number are significant for lung cancer and not bladder cancer.

P Values

P Values

Adjusted for smoking, only HDL cholesterol is still significant. Diabetes is no longer significant for both models.

P Values

P Values

We see that diabetes is significant for bladder cancer and not lung when adjusted for smoking.

P Values

P Values

Here there are much fewer significant markers, notably diabetes and education are less significant. The scale of the graph is also much smaller - meaning some effect has been taken away by adjusting for smoking. The significance of deprivation indicators has halved.

Forest

Forest

Some ORs go off scale here, or we can not see the CIs. This is due to varied scale of confidence intervals. I tried free_y, it was a disaster.

The ORs that go off scale are connected to smoking status.

Forest Zoom out

Forest Zoom out

This forest plot is zoomed out. We lose information visually on the CIs of the ORs, but we can still see the sharp contrast of the smoking variables in the middle, and how sociodemographic variables are reduced in effect when we control for smoking.

Forest Plotrix

Forest Plotrix

Here I’m just playing around, overlaying all of the models on top of one graph. What we can see is that often the models that cluster together are usually the ones adjusted for smoking, rather than clustering by cancer type.

-> How can we investigate this further?

Forest Plotrix

Forest Plotrix

This looks terrible and I’m stuck on figuring out how to plot vertical confidence intervals.

-> Any advice?

5 Sensitivity Analysis by Age at Diagnosis and Time to Diagnosis

What I did:

Age at Diagnosis Analysis

Age at Diagnosis Analysis

Age at Diagnosis Analysis

Age at Diagnosis Analysis

Time to Diagnosis Analysis

Time to Diagnosis Analysis

Time to Diagnosis Analysis

Time to Diagnosis Analysis

6 LASSO Logistic Regression

Models:

Denoised using linear regression and logistic regression for continuous and categorical variables, respectively. One-hot encoding used for categorical variables with more than 2 levels.

Additionally, models with forced confounders were run to check for any biase in the denoised datasets.

Four models run for each outcome (lung/bladder cancer):

6.1 Denoised

6.1.1 Calibration

Lung
Lung: Calibration of lambda and pi (Top = base model; Bottom = model adjusted for smoking)Lung: Calibration of lambda and pi (Top = base model; Bottom = model adjusted for smoking)

Lung: Calibration of lambda and pi (Top = base model; Bottom = model adjusted for smoking)

Bladder
Bladder: Calibration of lambda and pi (Top = base model; Bottom = model adjusted for smoking)Bladder: Calibration of lambda and pi (Top = base model; Bottom = model adjusted for smoking)

Bladder: Calibration of lambda and pi (Top = base model; Bottom = model adjusted for smoking)

6.1.2 Mean Odds Ratios

Lung
Lung: Mean Odds Ratio

Lung: Mean Odds Ratio

Bladder
Bladder: Mean Odds Ratio

Bladder: Mean Odds Ratio

6.1.3 Selection Proportion

Dashed red line: \(threshold = max(\hat{\pi}_{base}, \hat{\pi}_{adjusted})\)

Lung

Lung: Selection Proportion

Lung: Selection Proportion

Bladder

Bladder: Selection Proportion

Bladder: Selection Proportion

Checking consistency in sign of the beta coefficients for the variables with high selprop

Lung
Lung: Base model (left) and Adjusted model (right) AUCLung: Base model (left) and Adjusted model (right) AUC

Lung: Base model (left) and Adjusted model (right) AUC

Bladder
Bladder: Base model (left) and Adjusted model (right) AUCBladder: Base model (left) and Adjusted model (right) AUC

Bladder: Base model (left) and Adjusted model (right) AUC

6.1.4 Prediction Performance

Lung
Lung: Base model (left) and Adjusted model (right) AUCLung: Base model (left) and Adjusted model (right) AUC

Lung: Base model (left) and Adjusted model (right) AUC

Bladder
Bladder: Base model (left) and Adjusted model (right) AUCBladder: Base model (left) and Adjusted model (right) AUC

Bladder: Base model (left) and Adjusted model (right) AUC

6.2 Forced (using Penalty.Factor)

6.2.1 Calibration

Lung
Lung: Calibration of lambda and pi (Top = base model; Bottom = model adjusted for smoking)Lung: Calibration of lambda and pi (Top = base model; Bottom = model adjusted for smoking)

Lung: Calibration of lambda and pi (Top = base model; Bottom = model adjusted for smoking)

Bladder
Bladder: Calibration of lambda and pi (Top = base model; Bottom = model adjusted for smoking)Bladder: Calibration of lambda and pi (Top = base model; Bottom = model adjusted for smoking)

Bladder: Calibration of lambda and pi (Top = base model; Bottom = model adjusted for smoking)

6.2.2 Mean Odds Ratios

Lung
Lung: Mean Odds Ratio

Lung: Mean Odds Ratio

Bladder
Bladder: Mean Odds Ratio

Bladder: Mean Odds Ratio

6.2.3 Selection Proportion

Lung
Lung: Selection Proportion

Lung: Selection Proportion

Bladder
Bladder: Selection Proportion

Bladder: Selection Proportion

Checking consistency in sign of the beta coefficients for the variables with high selprop Lung
Lung: Base model (left) and Adjusted model (right) AUCLung: Base model (left) and Adjusted model (right) AUC

Lung: Base model (left) and Adjusted model (right) AUC

Bladder
Bladder: Base model (left) and Adjusted model (right) AUCBladder: Base model (left) and Adjusted model (right) AUC

Bladder: Base model (left) and Adjusted model (right) AUC

6.2.4 Prediction Performance

Lung
Lung: Base model (left) and Adjusted model (right) AUCLung: Base model (left) and Adjusted model (right) AUC

Lung: Base model (left) and Adjusted model (right) AUC

Bladder
Bladder: Base model (left) and Adjusted model (right) AUCBladder: Base model (left) and Adjusted model (right) AUC

Bladder: Base model (left) and Adjusted model (right) AUC

7 sPLS

7.1 lung cancer

7.1.1 Calibration



Stability analyses for sPLS on lung adjusted for age, sex and BMI

Stability analyses for sPLS on lung adjusted for age, sex and BMI



Stability analyses for sPLS on lung adjusted for age, sex, BMI and smoking

Stability analyses for sPLS on lung adjusted for age, sex, BMI and smoking



7.1.2 Stability selection

Lambda = 36, proportion = 0.9
Stability selection for sPLS on lung adjusted for age, sex, and BMI

Stability selection for sPLS on lung adjusted for age, sex, and BMI



Selection proportion for sPLS on lung adjusted for age, sex, and BMI

Selection proportion for sPLS on lung adjusted for age, sex, and BMI



Use results from stability selection for sPLS, lambda = 36

Loading coefficients from sPLS on lung adjusted for age, sex, and BMI

Loading coefficients from sPLS on lung adjusted for age, sex, and BMI



Lambda = 38, proportion = 0.9
Stability selection for sPLS on lung adjusted for age, sex, BMI and smoking

Stability selection for sPLS on lung adjusted for age, sex, BMI and smoking



Selection proportion for sPLS on lung adjusted for age, sex, BMI and smoking

Selection proportion for sPLS on lung adjusted for age, sex, BMI and smoking



Use results from stability selection for sPLS, lambda = 38

Loading coefficients from sPLS on lung adjusted for age, sex, BMI and smoking

Loading coefficients from sPLS on lung adjusted for age, sex, BMI and smoking

7.2 bladder cancer

7.2.1 Calibration



Stability analyses for sPLS on bladder adjusted for age, sex and BMI

Stability analyses for sPLS on bladder adjusted for age, sex and BMI



Stability analysis for sPLS on bladder adjusted for age, sex, BMI and smoking

Stability analysis for sPLS on bladder adjusted for age, sex, BMI and smoking



7.2.2 Stability selection

Lambda = 22, proportion = 0.9
Stability selection for sPLS on bladder adjusted for age, sex, and BMI

Stability selection for sPLS on bladder adjusted for age, sex, and BMI



Selection proportion for sPLS on bladder adjusted for age, sex, and BMI

Selection proportion for sPLS on bladder adjusted for age, sex, and BMI



Use results from stability selection for sPLS, lambda = 22

Loading coefficients from sPLS on bladder adjusted for age, sex, and BMI

Loading coefficients from sPLS on bladder adjusted for age, sex, and BMI



Lambda = 26, proportion = 0.9
Stability selection for sPLS on bladder adjusted for age, sex, BMI and smoking

Stability selection for sPLS on bladder adjusted for age, sex, BMI and smoking



Selection proportion for sPLS on bladder adjusted for age, sex, BMI and smoking

Selection proportion for sPLS on bladder adjusted for age, sex, BMI and smoking



Use results from stability selection for sPLS, lambda = 26

Loading coefficients from sPLS on bladder adjusted for age, sex, BMI and smoking

Loading coefficients from sPLS on bladder adjusted for age, sex, BMI and smoking

8 Discussion

8.1 LASSO

Lung cancer:

  • Selected variables attenuated after adjustment for smoking:
    • Rented accommodation
  • Selected variables strengthened after adjustment for smoking:
    • Townsend deprivation index
    • Average household income (31,000-51,999)
  • Selected for both models
    • High education attainment
    • Average household income (>52,000)
    • Parent history of COPD
    • Number of medication (>1)
    • Cardiovascular
    • Respiratory
    • Alkaline phosphatase
    • C reactive protein
    • Cholesterol

Bladder cancer:

  • Selected variables strengthened after adjustment for smoking:
    • Close to major road
  • Selected for both:
    • Rented accommodation
    • High education attainment
    • Parent history of COPD
    • Apolipoprotein A
    • Testosterone

Key points:

  • Sociodemographic factors associated with both lung and bladder cancer, especially education level.
  • Parental history of COPD associated with both lung and bladder cancer.
  • Bladder cancer: negative association with testosterone = women more at risk?

Questions

  • Why is LASSO model with forced variables more stringent than denoised model?
  • How to report the three plots?
    • Selection proportion – order by variable groupings? or selection?
    • AUC as a function of number of predictors included in model – overlap models?

8.2 sPLS

Lung cancer:

  • No. of variables to be selected stably:
    • adjustment for age, sex, and BMI: 26
    • adjustment for age, sex, BMI and smoking: 23
  • Selected variables attenuated after adjustment for smoking:
    • Coffee ≥4 cups
    • Always add salt to food
    • Total protein
    • Urea
  • Selected variables strengthened after adjustment for smoking:
    • Parental history of lung cancer
    • Having hypertension
    • Apolipoprotein B
    • Glucose

Bladder cancer:

  • No. of variables to be selected stably:
    • adjustment for age, sex, and BMI: 11
    • adjustment for age, sex, BMI and smoking: 8
  • Selected variables attenuated after adjustment for smoking:
    • Always add salt to food
  • Selected variables strengthened after adjustment for smoking:
    • Take processed meat more than once a week
    • PM 2.5 absorbance
    • PM 10
    • Parental history of CHD
    • Alanine aminotransferase
    • Calcium
    • Phosphate
    • SHBG
    • Urea

Question

  • selection proportion paths
  • gPLS

8.3 Plan

Report (Results section) outline:

  1. Descriptive statistics and univariate analysis
    1. Table 1
    2. Manhattan plots
    3. Forest plots
    4. Scatter plots
  2. Multivariate analysis:
    1. LASSO (OR, Selection proportion and AUC)
    2. gPLS, sgPLS (Loading coefficients and Selection proportion)
  3. Targeted analyses by lung cancer subtypes (If the time allows)
    1. Run LASSO (?)
  4. Sensitivity analysis
    1. Stratify by time-to-diagnosis
    2. Stratify by age at diagnosis
  5. Contextualize findings, especially biomarker meanings

Next steps are in bold